Prognostic factors in diabetes: Comparison of Chi-square automatic interaction detector (CHAID) decision tree technology and logistic regression

This study aimed to develop a diabetes prediction model. The model performance was compared with logistic regression, and the decision tree Chi-square automatic interaction detection (CHAID) was used to predict diabetes. In total, 3233 patients were included in the analysis. Of these, 589 patients with diabetes and 2644 patients without diabetes were included after analyzing the study sample from the Korean Genome and Epidemiology Study (KoGES)-8 data. In this study, Diabetes Mellitus (DM) diagnosis prediction was compared with logistic regression and prediction through machine learning (ML) using the CHAID decision classification tree. We performed statistical analysis using the CHAID method with International Business Machine (IBM) statistical program SPSS®. We performed logistic regression analysis to predict the classification of diabetes accurately, and the total classification accuracy of the analysis was 81.7%, and the CHAID decision tree classification accuracy was 81.8%. A diabetes diagnosis decision tree was created, which included seven terminal nodes and three depth levels. This analysis showed that a blood pressure problem and hospital visits were the most decisive variables at the time of classification, and two risk levels were created for diabetes diagnosis. The suggested method is a valuable tool for predicting diabetes. Patients who visit the hospital because of blood pressure problems are more likely to develop diabetes than under-treating hyperlipidemia. The diabetes prediction model can help doctors make decisions by detecting the possibility of diabetes early; however, it is impossible to diagnose diabetes using only the model without the doctor’s opinion.


Introduction
The world's diabetes population is estimated to be more than 400 million adults, and by 2045, it is expected that among 700 million people, more than one in 10 adults will have diabetes. [1] According to the Organization for Economic Co-operation and Development (OECD) health statistics, an average of 22.7 deaths per 100,000 people were due to diabetes. [2] In Korea, the prevalence of diabetes is estimated to be 1 in 7 adults aged over 30 years, and approximately 50 million people are being treated for diabetes. [3] If diabetes can be predicted, preventive treatment, diabetes-related complications, and medical expenses will be reduced, and quality of life will improve. [4] Since 2015, the Obama administration has implemented the Precision Medical Cohort Program (PMI-Cohort Program), a cancer genome discovery and clinical application, and promoted the transition from treatment-oriented to prevention-oriented medical systems. [5] Diabetes mellitus (DM) includes type 1 diabetes, which occurs when the pancreas does not secrete insulin, and type II diabetes, which occurs when insulin is secreted but insulin resistance is increased. [3] Diabetes is a chronic disease accompanied by complications, including retinopathy, nephropathy, and neuropathy. Moreover, it causes multiple risks, such as cardiovascular disease, and requires continuous management, treatment, and lifestyle changes. [6] In addition, diabetes is a major risk factor HYC and EYK contributed equally to this work.

This study was supported by the Korea University Ansan Hospital and the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2022R1I1A1A01071220). The results and conclusions of this study do not represent overall KoGES data. Medicine
for cardiovascular disease, and diabetic complications tend to increase the burden of patients' medical expenses, increase socioeconomic losses, and have higher mortality rates than patients with other diseases. [7] Many studies have assessed risk scores to select patients at high risk of diabetes. [8][9][10][11][12] Diabetes is most effective in the prevention and management of high-risk groups before occurrence is very important. Therefore, it has become more important to identify appropriate criteria to predict and mediate the onset of diabetes in advance. [13,14] Artificial intelligence/machine learning (AI/ML) is called "learning from data" or "data-driven algorithm" data-driven algorithms, which find the classification and clustering rules inherent in the data by applying feature representation and learning algorithms to the collected data. [15] Recently, modeling research using ML technology based on electronic medical record (EMR) big data analysis has progressed with development and showed almost similar predictions to clinical diagnoses. [16] Based on the demographics and clinical factors of patients with diabetes, we have also developed a diabetes prediction self-measurement model that is easily accessible and can be used by the general public.
This study compared and analyzed the logistic regression analysis among the traditional statistical methods for predicting diabetes occurrence and the CHIAD model among the data mining prediction methods. In addition, we analyzed classification and predictive models through the interaction effect and non-linearity of explanatory variables affecting the occurrence of diabetes with CHAID.

Methods
In this study data, the Korean Genome and Epidemiology Study (KoGES-8) of a community-based cohort (Ansan, Anseong) of the (KoGES) were used. The KoGES data is composed of "population-based" data from adults aged 50 or older and "gene-environment" to identify risk factors for genetic-environmental interactions in chronic diseases at the Department of Genetic Mechanics at the Korea Disease Control and Prevention Agency (KDCA). A total of 3233 patients were included. Of which 589 patients with diabetes and 2644 patients without diabetes were included by cleaning the study sample from the KoGES-8 data. For this study, data were received and analyzed according to the online procedure after approval by the institutional review board of Korea University Medical Center (KUMC IRB-2020AS0124).
Chi-square automatic interaction detection (CHAID) classification tree: The reason for diabetes diagnostic prediction modeling using a CHAID decision tree is a graphic representation of a series of decision rules. Beginning with a root node containing all the cases, the tree branch is divided into different child nodes containing case subgroups. The criterion for partitioning (or branching) was selected after reviewing all possible variables of all available predictive variables. In the terminal node, a grouping of cases is obtained; thus, possible cases are homogeneous in relation to the values of the dependent variables. [16][17][18] This algorithm determines how to optimally combine categorical or continuous variables to predict binary results based on "if-then" logic by dividing each independent variable into mutually exclusive subsets based on data homogeneity. In this study, the response variable was the presence or absence of diabetes diagnosis. Statistical analysis using the CHAID method was performed using the CHAID node included in the International Business Machine (IBM) statistical program (SPSS).

Statistical analysis
For demographic and clinical characteristics, we evaluated differences between groups using chi-square tests for categorical variables or Fisher exact tests, and Mann-Whitney U test or Student t-test for continuous and ordered variables, if appropriate. Discrete variables were expressed as number (percentage) and continuous variables as average (mean) and standard deviation (SD). General DM diagnosis prediction was compared with logistic regression and prediction through ML using the CHAID decision classification tree. The statistical significance level was set on 0.05. IBM Statistical Package for the Social Sciences (SPSS) program ver. 25.0 (IBM Corp., Armonk, NY, USA) was used for data analysis.

Results
The characteristics of the study population are summarized in Table 1. The mean age was 68.0 ± 8.0 for the non-diabetes group and 70.0 ± 8.0 for the diabetes group, indicating statistical significance (P < .001). Waist size was 89.5 ± 8.9 cm for the patients without diabetes and 92.0 ± 9.0 cm for the patients with diabetes, indicating a significant difference in waist size between the two groups (P < .001). The weight was 60.1 ± 10.6 kg for the patients without diabetes and 62.0 ± 10.5 kg for those with diabetes, indicating a statistical significance (P < .001). In the "no Table 1 DM diagnosis and characteristics of the study population.

Variables
Not have diabetes Diabetes

Diabetes and physical factor concern
Diabetes and physical factors are shown in

Diabetes and other present disease concern
The concerns regarding diabetes and other diseases are listed in Table 3. In "visit a hospital (medical institution) with blood pressure problem (last 2 years)," for the "no" group, out of 1503 patients, 56.9% did not have diabetes, and out of 183 patients, 31.1% had diabetes, whereas for the "yes" group, out of 1137 patients, 43.1% did not have diabetes, and out of 406 patients, 68.9% had diabetes, indicating that visited the hospital due to blood pressure problem was statistically significant (P < .001). In "myocardial infarction treatment status," for the "no" group out of 2620 patients, 99.1% did not have diabetes, and out of 574 patients, 97.5% had diabetes, whereas for the "yes" group, out of 24 patients, 0.9% did not have diabetes, and out of 15 patients, 2.5% had diabetes, indicating a statistical significance (P = .001). In the "congestive heart failure treatment status," for the "no" group, out of 2644 patients, 100.00% did not have diabetes, and out of 588 patients, 99.8% had diabetes, whereas for the "yes" group, out of one patient, 0.2% had diabetes, indicating statistical significance (P = .034). In "coronary artery disease treatment status," for the "no" group, out of 2565 patients, 97.0% did not have diabetes, and out of 556 patients, 94.4% had diabetes, whereas for the "yes" group, out of 79 patients, 3.0% did not have diabetes, and out of 33 patients, 5.6% had diabetes, indicating a statistical significance (P = .002). In "hyperlipidemia treatment status," for the "no" group, out of 2310 patients, 87.4% did not have diabetes, and out of 460 patients, 78.1% had diabetes, whereas for the "yes" group, out of 334 patients, 12.6% did not have diabetes, and out of 129 patients, 21.9% had diabetes, indicating a statistical significance (P < .001). In the "renal treatment status," for the "no" group, out of 2627 patients, 99.4% did not have diabetes, and out of 580 patients, 98.5% had diabetes, whereas for the "yes" group, out of 17 patients, 0.6% did not have diabetes, and out of 9 patients, 1.5% had diabetes, indicating statistical significance (P = .030).

Logistic regression
Logistic regression analysis was performed to predict the presence of diabetes in Table 4. The fit of the model was suitable, with P = .305 in the Hosmer and Lemeshow tests. All variables were entered into the model. Among the 2569 cases without diabetes, 2558 (99.6%) were accurately classified, and among the 569 cases with diabetes, 1.1% and six cases were accurately classified, with a total classification accuracy of 81.7%. Logistic regression analysis "weight gain or loss over the past month" was found to be Wald = 13.109, P < .001, and β = 0.516 showed a positive value, indicating that the higher the hyperlipidemia treatment status, the higher the probability of diabetes. "Visit the hospital due to blood pressure problem" was found to be Wald = 76.954, P < .001, and β = 0.931 showed a positive value, indicating a higher probability of having diabetes. The myocardial infarction treatment status" was found to be Wald = 7.839, P = .005, and β = 0.990 showed a positive value, indicating that the higher the myocardial infarction treatment status, the higher the probability of having diabetes. "Coronary artery disease treatment status" was found to be Wald = 4.461, P = .035, and β = 0.490 showed a positive value, indicating the higher the coronary artery disease treatment status, the more probability of having diabetes. "Hyperlipidemia treatment status" was found to be Wald = 12.244, P < .001, and β = 0.431 showed a positive value, indicating that the higher the hyperlipidemia treatment status, the higher the probability of diabetes. The estimated regression equation for diabetes prediction was calculated as follows: OR (diabetes) = -4.716 + 0.516*(weight gain/ loss) + 0.931*(blood pressure visit hospital) + 0.990*(myocardial infarction-1) + 0.490*(coronary artery disease-1) + 0.431*(hyperlipidemia-1)

CHAID classification tree
In Figure 1, the analysis was conducted using the CHAID decision tree technique to obtain the best cutoff point for diabetes. Diabetes diagnosis was included as a dependent variable, and demographic (age, waist size, weight, drinking water, and weight change) and clinical (blood pressure problem, cardiac infarction, coronary artery disease, hyperlipidemia, and kidney disease) variables were used as independent variables. The maximum tree depth was three, with 100 minimum cases in the parent node and 50 minimum cases in the child node. The classification accuracy is 81.8%. A diabetes diagnosis decision tree was created, which included seven terminal nodes and three depth levels. This analysis showed that a blood pressure problem and hospital visits were the most decisive variables at the time of classification, and two risk levels were created for diabetes diagnosis. Visit to a hospital with a blood pressure problem ("no or yes"). Diabetes accounted for 10.8% of hospital visits for blood pressure problems (no). However, if the sub-node "hyperlipidemia" was treated, the diabetes probability increased to 23.0% (X 2 = 22.374, P < .001), and visit to a hospital with blood pressure problems "no" was 10.8%. However, the probability of DM diagnosis increased to 16.2% in the case of sub-node hyperlipidemia "no" and weight loss (X 2 = 11.755, P = .002). Visit to a hospital with blood pressure problems "yes" had a high diabetes diagnostic probability of 26.3% (X 2 = 129.792, P < .001).

Discussion
The results of the study on patients diagnosed with diabetes are as follows: Based on the results of the logistic regression prediction, weight gain/loss and visits to a medical institution Table 3 Diabetes and other present disease concern.
Variables Not have diabetes Diabetes with blood pressure problems, myocardial infarction, coronary artery disease, and hyperlipidemia were the main factors in predicting diabetes. First, the CHAID model showed stage interactions between risk factors in step-by-step path identification to detect diabetes. The CHAID model was the strongest variable related to diabetes diagnosis, and "visit to a medical institution due to a blood pressure problem" was divided into the first level of the higher partitions than that of other variables. Second, among patients who did not visit a medical institution due to blood pressure problems, "hyperlipidemia treatment status" is an important predictor variable and has a 13.2% higher incidence of diabetes. Logistic regression also showed factors for detecting diabetes, but the CHAID model easily showed multilevel interactions by showing critical predictors in priority order. Therefore, the CHAID model is a tool that detects diabetes and supports clinician decisions, similar to the importance of logistic regression, a research method already known to detect diabetes. Blood pressure problems were identified as the most important predictors of diabetes. Patients with blood pressure problems had a 16.3% higher incidence rate of diabetes than those without, which could eventually decrease diabetes if blood pressure is managed first. In the presence or absence of hyperlipidemia treatment linked to blood pressure problems, it is found that the patient's weight gain or loss that is not currently being treated should be confirmed, which will help in clinical decision making by showing detailed diabetes prediction. [19][20][21] These results showed a similar discriminant performance to that reported in other studies. Unlike general analysis, CHAID decision tree analysis can easily improve predictive ability using multivariate models, even in special situations, by analyzing the interactions of various variables and applying them to the entire population. Therefore, it is possible to detect individual cases showing unique behaviors within the entire research group that cannot be identified using the existing analysis methods. This result suggests that the incidence of diabetes can be easily predicted by excluding all clinical considerations for diabetes. However, there is a weakness in the diagnosis and judgment of clinical practice when considering the patient's characteristics. This may help doctors prioritize the classification of patients according to their risk of developing diabetes. [22] This study has some limitations. The study sample was limited to some regions and only Korean patients with diabetes were included. Factors for diabetes management were not considered. In addition, a predictive study was conducted on the entire disease spectrum of diabetes without distinguishing the detailed characteristics of diabetes. In the future, more largescale randomized studies are required to clarify and specify the impact of the CHAID algorithm. Finally, in the CHAID model method, the number of terminal nodes tends to increase; Fracture (both simple and severe fractures) of pressure for the last 2 years, Blood Problem (Hypertension or Hypotension): Visit to the hospital with blood pressure for the last 2 years, Water: increased water intake, weight gain/loss, and the treatment status for myocardial infarction, congestive heart failure, coronary artery disease, hyperlipidemia, and kidney disease. however, information overload may occur because the number of target patients for each node is small. However, by predicting the occurrence of diabetes, it will be possible to reduce the incidence of diabetes, complications, and medical costs, and improve patient quality of life by predicting diabetes and other socially expensive and time-consuming diseases.

Conclusion
We identified blood pressure problems as the most important predictor of diabetes. Patients with blood pressure problems are more likely to develop diabetes than those without, and managing blood pressure first can eventually reduce diabetes. CHAID decision tree analysis analyzes the interaction of various variables. It applies to the entire population, making it easy to improve predictive ability using multivariate models even in particular situations. The suggested method can easily predict diabetes incidence within the study group, which conventional analytical techniques cannot identify. By predicting the onset of diabetes, it will be possible to reduce the incidence, complications, and medical costs of diabetes and improve patients' quality of life by predicting diabetes and other socially expensive and time-consuming diseases.
In the future, more extensive randomized studies to clarify and refine the impact of the CHAID algorithm.